-
Notifications
You must be signed in to change notification settings - Fork 1.9k
test(perf): Add Llama-3_1-Nemotron-Ultra-253B-v1 perf tests (cpp)
#4446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test(perf): Add Llama-3_1-Nemotron-Ultra-253B-v1 perf tests (cpp)
#4446
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces performance tests for the Llama-3_1-Nemotron-Ultra-253B-v1 model using the cpp TRT backend to ensure that both low- and high-concurrency scenarios pass within CI limits.
- Added new test entries with appropriate parameters (max batch size, input/output lengths, concurrency, etc.) in the QA test list.
- Updated model mapping in test_perf.py to include the new ultra model for both native and Hugging Face identifiers, and appended a build flag when remote code is trusted.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| tests/integration/test_lists/qa/trt_llm_release_perf_test.yml | Added new performance test entries for Llama-3_1-Nemotron-Ultra-253B-v1. |
| tests/integration/defs/perf/test_perf.py | Added new model mapping entries and introduced a build flag for TRUST_REMOTE_CODE_MODELS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds C++-backend FP8 performance tests for the llama-3.1-nemotron-ultra-253b-v1 model and hooks it into the test runner, including enabling remote code trust for quantized builds
- New YAML entries in
trt_llm_release_perf_test.ymlfor low/high concurrency FP8 benchmarks - Model mapping definitions added in
test_perf.pyfor both C++ and HF backends - Auto-enables
--trust_remote_codefor models listed inTRUST_REMOTE_CODE_MODELSduring build
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| tests/integration/test_lists/qa/trt_llm_release_perf_test.yml | Added four new perf test invocations covering C1–C4 scenarios for the new Ultra-253B model |
| tests/integration/defs/perf/test_perf.py | Registered llama_v3.1_nemotron_ultra_253b[_hf] mappings and added --trust_remote_code flag |
Comments suppressed due to low confidence (3)
tests/integration/defs/perf/test_perf.py:58
- The repository path uses 'Llama-3_1...' with an underscore instead of the dot notation ('Llama-3.1...'), which is inconsistent with other model paths and may break resolution. Update to match the existing naming convention.
"nemotron-nas/Llama-3_1-Nemotron-Ultra-253B-v1",
tests/integration/defs/perf/test_perf.py:105
- The HuggingFace model path uses an underscore in 'Llama-3_1...' instead of 'Llama-3.1...'; this diverges from established naming and may cause lookup failures. Please correct it.
"nvidia/Llama-3_1-Nemotron-Ultra-253B-v1",
tests/integration/defs/perf/test_perf.py:932
- Automatically setting
--trust_remote_code=Truecan pose security risks if unreviewed code is pulled. Ensure these models are audited or document why remote code trust is safe here.
if self._config.model_name in TRUST_REMOTE_CODE_MODELS:
acbc33c to
02e6b02
Compare
Signed-off-by: Venky Ganesh <[email protected]>
eb62a33 to
0bb8c75
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #6157 [ run ] triggered by Bot |
|
PR_Github #6157 [ run ] completed with state |
…VIDIA#4446) ultra Signed-off-by: Venky Ganesh <[email protected]>
…ests (cpp) (#4446) (#4590) Signed-off-by: Venky Ganesh <[email protected]>
…VIDIA#4446) ultra Signed-off-by: Venky Ganesh <[email protected]> Signed-off-by: darraghdog <[email protected]>
Description
Add
llama-3.1-nemotron-ultra-253b-v1perf tests converage.This only adds cpp backend for fp8.
This is because, since a model this big rarely is used in bf16 precision, only resource-efficient to test on fp8. PyT backend fp8 requires pre-quantized fp8 checkpoints (we currently don't have it added).
Invariants
max_batch_sizetrtllm-benchFour sequence profiles were benchmarked:
reqs = 8,con = 1reqs = 250,con = 250Performance Summary
(tok/s)
(ms)
Latency percentiles
Concurrency = 1 (C1 & C2)
Concurrency = 250 (C3 & C4)